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ABSTRACT 

Our objective is to sample the node set of a large unknown 
graph via crawling, to accurately estimate a given metric of 
interest. We design a random walk on an appropriately de- 
fined weighted graph that achieves high efficiency by prefer- 
entially crawling those nodes and edges that convey greater 
information regarding the target metric. Our approach be- 
gins by employing the theory of stratification to find opti- 
mal node weights, for a given estimation problem, under an 
independence sampler. While optimal under independence 
sampling, these weights may be impractical under graph 
crawling due to constraints arising from the structure of the 
graph. Therefore, the edge weights for our random walk 
should be chosen so as to lead to an equilibrium distribution 
that strikes a balance between approximating the optimal 
weights under an independence sampler and achieving fast 
convergence. We propose a heuristic approach (stratified 
weighted random walk, or S-WRW) that achieves this goal, 
while using only limited information about the graph struc- 
ture and the node properties. We evaluate our technique 
in simulation, and experimentally, by collecting a sample 
of Facebook college users. We show that S-WRW requires 
13-15 times fewer samples than the simple re-weighted ran- 
dom walk (RW) to achieve the same estimation accuracy for 
a range of metrics. 

1. INTRODUCTION 

Many types of online networks, such as online social net- 
works (OSNs), Peer-to-Peer (P2P) networks, or the World 
Wide Web (WWW), are measured and studied today via 
sampling techniques. This is due to several reasons. First, 
such graphs are typically too large to measure in their en- 
tirety, and it is desirable to be able to study them based on 
a small but representative sample. Second, the information 
pertaining to these networks is often hard to obtain. For ex- 
ample, OSN service providers have access to all information 
in their user base, but rarely make this information publicly 
available. 

There are many ways a graph can be sampled, e.g., by 
sampling nodes, edges, paths, or other substructures [231 
127) . Depending on our measurement goal, the elements 
with difi'erent properties may have different importance and 
should be sampled with a different probability. For exam- 
ple, Fig. [TJa) depicts the world's population, with residents 
of China (1.3B people) represented by blue nodes, of the 
Vatican (800 people) by black nodes, and all other nation- 



* This is an extended version of a paper with the same title 
presented at SIGMETRICS'll. This work was supported by 
SNF grant PBELP2- 130871, 
GDI Award 1028394, USA. 



alities represented by white nodes. Assume that we want to 
compare the median income in China and Vatican. Taking 
a uniform sample of size 100 from the entire world's popu- 
lation is ineffective, because most of the samples will come 
from countries other than China and Vatican. Even restrict- 
ing our sample to the union of China and Vatican will not 
help much, as our sample is unlikely to include any Vatican 
resident. In contrast, uniformly sampling 50 Chinese and 50 
Vaticanese residents would be much more accurate with the 
same sampling budget. 

This type of problem has been widely studied in the sta- 
tistical and survey sampling literature. A commonly used 
approach is stratified sampling |12I28I34"] . where nodes {e.g., 
people) are partitioned into a set of non-overlapping cate- 
gories (or strata). The objective is then to decide how many 
independent draws to take from each category, so as to min- 
imize the uncertainty of the resulting measurement. This 
effect can be achieved in expectation by a weighted indepen- 
dence sampler (WIS) with appropriately chosen sampling 
probabilities tt™'^. In our example, WIS samples Vatican 
residents with much higher probabilities than Chinese ones, 
and avoids completely the rest of the world, as illustrated in 
Fig. lib). 

However, WIS, as every independence sampler, requires 
a sampling frame, i.e., a list of all elements we can sample 
from {e.g., a list of all Facebook users). This information is 
typically not available in today's online networks. A feasible 
alternative is crawling (also known as exploration or link- 
trace sampling). It is a graph sampling technique in which 
we can see the neighbors of already sampled users and make 
a decision on which users to visit next. 

In this paper, we study how to perform stratified sam- 
pling through graph crawling. We illustrate the key idea 
and some of the challenges in Fig. [T] Fig. [TJc) depicts a so- 
cial network that connects the world's population. A simple 
random walk (RW) visits every node with frequency propor- 
tional to its degree, which is reflected by the node size. In 
this particular example, for a simplicity of illustration, all 
nodes have the same degree equal to 3. As a result, RW is 
equivalent to the uniform sample of the world's population, 
and faces exactly the same problems of wasting resources, 
by sampling all nodes with the same probability. 

We address these problems by appropriately setting the 
edge weights and then performing a random walk on the 
weighted graph, which we refer to as weighted random walk 
(WRW). One goal in setting the weights is to mimic the 
WIS-optimal sampling probabilities tt™^^ shown in Fig.[l][b). 
However, such a WRW might perform poorly due to poten- 
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Figure 1: Illustrative example. Our goal is to compare the blue and black subpopulations (e.g., with respect 
to their median income) in population (a). Optimal independence sampler, WIS (b), over-samples the black 
nodes, under-samples the blue nodes, and completely skips the white nodes. A naive crawling approach, 
RW (c), samples many irrelevant white nodes. WRW that enforces WIS-optimal probabilities may result in 
poor or no convergence (d). S-WRW (e) strikes a balance between the optimality of WIS and fast convergence. 



tially slow mixing. In our example, it will not even converge 
because the underlying weighted graph is disconnected, as 
shown in Fig.[ljd). Therefore, the edge weights under WRW 
(which determine the equilibrium distribution yr""^™) should 
be chosen in a way that strikes a balance between the opti- 
mality of TT^^^ and fast convergence. 

We propose Stratified Weighted Random Walk (S-WRW), 
a practical heuristic that effectively strikes such a balance. 
We refer to our approach as "walking on the graph with a 
magnifying glass", because S-WRW over-samples more rele- 
vant parts of the graph and under-samples less relevant ones. 
In our example, S-WRW results in the graph presented in 
Fig. me). The only information required by S-WRW are the 
categories of neighbors of every visited node, which is typ- 
ically available in crawlable online networks, such as Face- 
book. S-WRW uses two natural and easy-to-interpret pa- 
rameters, namely: (i) /q, which controls the fraction of 
samples from irrelevant categories and (ii) 7, which is the 
maximal resolution of our magnifying glass, with respect to 
the largest relevant category. 

The main contributions of this paper are the following. 

• We propose to improve the efficiency of crawling-based 
graph sampling methods, by performing a stratified 
weighted random walk that takes into account not only 
the graph structure but also the node properties that 
are relevant to the measurement goal. 

• We design and evaluate S-WRW, a practical heuristic 
that sets the edge weights and operates with limited 
information. 

• As a case study, we apply S-WRW to sample Facebook 
and estimate the sizes of colleges. We show that S- 
WRW requires 13-15 times fewer samples than a simple 
random walk for the same estimation accuracy. 

The outline of the rest of the paper is as follows. Section 2 
summarizes the most popular graph sampling techniques, 
including sampling by exploration. Section 3 presents clas- 
sical stratified sampling. Section 4 combines stratified sam- 
pling with graph exploration, presenting a unified WRW ap- 
proach that takes into account both network structure and 
node properties; various trade-offs and practical issues are 
discussed and an efficient heuristic (S-WRW) is proposed 
based on the insights. Section 5 presents simulation results. 
Section 6 presents an implementation of S-WRW for the 
problem of estimating the college friendship graph on Face- 



book. Section 7 presents related work. Section 8 concludes 
the paper. 

2. SAMPLING TECHNIQUES 
2.1 Notation 

We consider an undirected, staticQ graph G = (V, -E), 
with A'^= |V^| nodes and \E\ edges. For a node « £ V", denote 
by deg(«) its degree, and by M{v) C V the list of neighbors 
of V. A graph G can be weighted. We denote by w(u, v) the 
weight of edge {u, v} G E, and by 



w(it) = ^ ^ w(it, v) 



(1) 



the weight of node u £ V. For any set of nodes A (- V, we 
define its volume vol(yl) and weight w{A), respectively, as 

vol(yl) = J2 deg(t^) and w(A) = ^ w(v). (2) 



We will often use 



\v\ 



and 



YO\{A) 

vol(V) 



(3) 



to denote the relative size of A in terms of the number of 
nodes and the volumes, respectively. 

Sampling. We collect a sample S C V of n — \S\ nodes. 
S may contain multiple copies of the same node, i.e., the 
sampling is with replacement. In this section, we briefiy 
review the techniques for sampling nodes from graph G. We 
also present the weighted random walk (WRW) which is the 
basic building block for our approach. 

2.2 Independence Sampling 

Uniform Independence Sampling (UIS) samples the 
nodes directly from the set V, with replacements, uniformly 
and independently at random, i.e., with probability 



"(y) = — 



for every v € V. 



(4) 



Weighted Independence Sampling (WIS) is a weighted 
version of UIS. WIS samples the nodes directly from the 
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set V, with replacements, independently at random, but 
with probabilities proportional to node weights w(i)): 



w{v) 



(5) 



In general, UIS and WIS are not possible in online networks 
because of the lack of sampling frame. For example, the list 
of all user IDs may not be publicly available, or the user 
ID space may be too sparsely allocated. Nevertheless, we 
present them as baseline for comparison with the random 
walks. 

2.3 Sampling via Crawling 

In contrast to independence sampling, the crawling tech- 
niques are possible in many online networks, and are there- 
fore the main focus of this paper. 

Simple Random Walk (RW) [23 selects the next-hop 
node V uniformly at random among the neighbors of the 
current node u. In a connected and aperiodic graph, the 
probability of being at the particular node v converges to 
the stationary distribution 



'{v) = 



deg(^;) 
2-\E\- 



(6) 



Metropolis-Hastings Random Walk (MHRW) is an 

application of the Metropolis-Hastings algorithm [30] that 
modifies the transition probabilities to converge to a desired 
stationary distribution. For example, we can achieve the 
uniform stationary distribution 



[v) 



_1_ 



(7) 



by randomly selecting a neighbor v of the current node u 
and moving there with probability min(l, ) . However, 

it was shown in [17I35| that RW (after re-weighting, as in 
Section [2.4(1 outperforms MHRW for most applications. We 
therefore restrict our attention to comparing against RW. 

Weighted Random Walk (WRW) is RW on a weighted 
graph [4]. At node u, WRW chooses the edge {u, v} to follow 
with probability Pu,v proportional to the weight w(u, v) > 
of this edge, i.e., 



p — 

J u,v — 



w(u, v) 



The stationary distribution of WRW is: 



'{v) = 



w{v) 



(8) 



(9) 



WRW is the basic building block of our design. In the next 
sections, we show how to choose weights for a specific esti- 
mation problem. 

Graph Traversals (BPS, DPS, RDS, ...) is a family 
of crawling techniques where no node is sampled more than 
once. Because traversals introduce a generally unknown bias 
(see Sec. [7}, we do not consider them in this paper. 

2.4 Correcting the bias 

RW, WRW, and WIS all produce biased (nonuniform) 
node samples. But their bias is known and therefore can be 
corrected by an appropriate re-weighting of the measured 



values. This can be done using the Hansen-Hurwitz estima- 
tor [19] as first shown in [39141) for random walks and also 
used in [35]. Let every node v £ V carry a value x{v). We 
can estimate the population total Xtat = ^(^) 



1 ■^-^ x{v) 
n ^ -jt(v) ' 



(10) 



where ■k{v) is the sampling probability of node v in the sta- 
tionary distribution. In practice, we usually know 7r(?;), and 
thus Xtat, only up to a constant, i.e., we know the (non- 
normalized) weights •w{v). This problem disappears when 
we estimate the population mean x^^ — J]]^ x{v)/N as 



ves Tv(v) _ i-^ves w(u) 

^v^S Tv{v) A^vGS w{v) 



(11) 



For example, for x{v) = 1 if deg(w) = k (and x{v) = other- 
wise), Xav(fc) estimates the node degree distribution in G. 

All the results in this paper are presented after this re- 
weighting step, whenever necessary. 

3. STRATIFIED SAMPLING 

In Sec. [T] we argued that in order to compare the me- 
dian income of residents of China and Vatican we should 
take 50 random samples from each of these two countries, 
rather than taking 100 UIS samples from China and Vati- 
can together (or, even worse, from the world's population). 
This problem naturally arises in the field of survey sam- 
pling. The most common solution is stratified sampling [121 
128134] . where nodes V are partitioned into a set C of non- 
overlapping node categories (or "strata"), with Ucgc C = V . 
Next, we select uniformly at random Ui nodes from cate- 
gory Ci. We are free to choose the allocation (ni, n2, . . . ,n\c\), 
as long as we respect the total budget of samples = "-i- 

Under proportional allocation [28] (or "prop') we use n; oc 
\Ci\, I.e., 



\Ci\-n/N. 



(12) 



Another possibility is to do an optimal allocation (or "opt") 
that minimizes the variance V of our estimator for the spe- 
cific problem of interest. For example, assume that every 
node V £V carries a value x(v), and we may want to esti- 
mate the mean of x in various scenarios, as discussed below. 

3.1 Examples of Stratified Sampling Problems 

3.1.1 Estimating the mean across the entire V 

A classic application of stratification is to better estimate 
the population mean /x, given several groups (strata) of dif- 
ferent properties {e.g., variances). Given n; samples from 
category d, we can estimate the mean = Eugc ^('^) 
over category d by 



with 



v(AO = 



(13) 



where ^{jli) is the variance of this estimator and is the 
variance of population d. We can estimate population 
mean /i by a weighted average over all fiiS [28], i.e.. 



M - 



E 



M 

N 



with 



v(/i) = E 



m ■ m 



Under proportional allocation (Eq.([T2|), this boils down to 
Yj'^piop-j _ _i_ i^^i ^2 jjQ-^^fgver, we can apply La- 
grange multipliers to find that V(/i) is minimized when 



\C.\ 



E,|c,l 



(14) 



This solution is sometimes called 'Neyman allocation' [34| . 
This gives us the variance under optimal allocation Y{fi°^*) — 

Nh (EJa|•a,)^ 

The variances and are measures of the 

performance of proportional and optimal allocation, respec- 
tively. In order to make their practical interpretation eas- 
ier, we also show how these variances translate into sample 
lengths. We define as gain a of 'opt' over 'prop' the number 
of times 'prop' must be longer than 'opt' in order to achieve 
the same variance 



gam a 



In that case, the gain is 



subject to 



= iV ■ 



(> i). 



(15) 



Notice that this gain does not depend on the sample budget 
n. The gain is one of the main metrics we will use in the 
evaluation sections to assess the efficiency of our technique 
compared to the random walk. 

3. 1.2 Highest precision for all categories 

If we are equally interested in each category, we might 
want the same (highest possible) precision of estimating /ii 
for all categories d. In this case, the metric to minimize is 

Vmax ~ maxi {V(/ii)} = maxi I ^ I . Under proportional 

allocation, this translates to VnJax ~ ^ max; . But the 
optimal rii, which makes equal for all i, is 



(16) 



Consequently, 



which leads to gain 



(>!)• 



(17) 



3. 1.3 Smallest sum of variances across categories 

Even if we are interested in all categories, an alternative 
objective is to maximize the average precision of category 
pair comparisons (see Sec. 5A.13 in [12]), which is equivalent 

to minimizing the sum Ve ~ Tli^ii^i) = Ei In this 
case, proportional allocation achieves V|f°'' = 
while, using Lagrange multipliers we get 

n and - ""'^ 



\Ci\- 



which leads to gain 



E, 



(> 1). 



(18) 



(19) 



3.1.4 Relative sizes of node categories 

Stratified sampling assumes that we know the sizes \Ci\ of 
node categories. In some applications, however, these sizes 
are unknown and among the values we need to estimate as 
well {e.g., by using UIS or WIS). We show in Appendix C 
(for \C\ — 2) that the optimal sample allocation and the 
corresponding gain a of WIS over UIS are respectively 



WIS ^ 1 

71, = T-— • n and 

\C\ 



4|Ci|-|C2| 



(20) 



3.1.5 Irrelevant category Cq ( aggregated) 

In many practical cases, we may want to measure some 
(but not all) node categories. E.g., in Fig. [1] we are in- 
terested in blue and black nodes, but not in white ones. 
Similarly, in our Facebook study in Section [6] we are only 
interested in self-declared college students, which accounts 
for only 3.5% of all users. We group all categories not 
covered by our measurement objective as a single irrele- 
vant category Cq G C, and we set n^' =0. In contrast, 
^grop _ \Cq\ ■ n/N. As a result, under 'opt' we have 
A''/(A'^— |Ce I) times more useful samples than under 'prop'. 
Now, if we allocate optimally all these useful samples be- 
tween the relevant categories C \ {Cq}, the gain a becomes 



N 



N- \Ce 



<C\{Ce}), 



(21) 



where a{C\{Ce}) is the gain {Tg]) or depend- 

ing on the metric, calculated only within categories C\{Cq}. 

In other words, gain a is now composed of two factors: 
(i) gain in avoiding irrelevant categories, and (ii) gain in 
optimal allocation of samples among the relevant categories. 

3.1.6 Practical Guideline 

Let us look at the optimal weights in the above scenarios, 
when all ai = a are the same. This is a reasonable working 
assumption in many practical settings, since we typically 
do not have prior estimates of ai. With this simplification, 
EQ. (fT4)) becomes 

opt Ci prop 

N 

In contrast, Eq.llS]), Eq.fTll) and Eq.JZO]) get simplified to 

1 



„opt _ 



In conclusion, if we are interested in comparing the node cat- 
egories with respect to some properties {e.g., average node 
degree, category size), rather than estimating a property 
across the entire population, we should take an equal num- 
ber of samples from every relevant category. 

4. EDGE WEIGHT SETTING UNDER WRW 

In the previous section, we studied the optimal sample 
allocation under (independence) stratified sampling. How- 
ever, independence node sampling is typically impossible in 
large online graphs, while crawling the graph is a natural, 
available exploration primitive. In this section, we show how 
to perform a weighted random walk (WRW) which approx- 
imates the stratified sampling of the previous section. We 
can formulate the general problem as follows: 



Given a measurement objective, error metric and sampling 
budget \S\=n, set the edge weights in graph G such that the 
WRW measurement error is minimized. 

Although we are able to solve this problem analytically 
for some specific and fully known topologies, it is not obvi- 
ous how to address it in general, especially under a limited 
knowledge of G. Instead, in this paper, we propose S-WRW, 
a heuristic to set the edge weights. S-WRW starts from a 
solution optimal under WIS, and takes into account practi- 
cal issues that arise in graph exploration. Once the weights 
are set, we simply perform WRW as described in Section[2]3] 
and collect samples. 

4.1 Preliminaries 

4.1.1 Category-level granularity 

One can think of the problem in two levels of granular- 



ity: the original graph G = {V, E) and the category 
G'~^ — {C,E^). In G'~^' , nodes represent categories, and ev- 
ery undirected edge {Ci,C2} £ represents the corre- 
sponding non-empty set of edges Eci,C2 G E in the original 
graph G, I.e., 

EcuC2 = {{w, v} £ E : u£Ci and v £ C2} / 0. 

In our approach, we move from the finer granularity of G 
to the coarser granularity of G*^. This means that we are 
interested in collecting, say, Ui samples from category d, 
but we do not control how these Ui nodes are collected (i.e., 
with what individual sampling probabilities). 

The rationale for that simplification is twofold. From a 
theoretical point of view, categories are exactly the prop- 
erties of interest in the estimation problems we consider. 
From a practical point of view, it is relatively easy to ob- 
tain or infer information about categories, as we show e.g., 
in Sec.|12II] 

4.1.2 Stratification in expectation 

Ideally, we would like to enforce strictly stratified sam- 
pling. However, when we use crawling instead of indepen- 
dence sampling, sampling exactly Ui nodes from category d 
(and no other nodes) is possible only by discarding observa- 
tions. It is thus more natural to frame the problem in terms 
of the probability mass placed on each category in equilib- 
rium. This can be achieved by making the weight w(Ci) 
of each category proportional to the desired number n; of 
samples, i.e., 

w{C^) cx m. (22) 
As a result, we draw rii samples from d in expectation. 

4.1.3 Main guideline 

As the main guideline, S-WRW tries to realize the cate- 
gory weights w™^^(Ci) that are optimal under WIS. There 
are many edge weight settings in G that achieve w*'^(Ci). 
In our implementation, we observe that vol(Ci) counts the 
number of edges incident on nodes of d. Consequently, if 
for every category d we set in G the weights of all edges 
incident on nodes in d to 



We(Ci) = 



Main guideline (to be modified) 

Set the edge weights in category d to w™^^(Ci) / vol(Gi). 



Step 1: Estimation of Category Volumes 

Estimate vol(Ci) with a pilot RW estimator vbl(Ci) as in Eg. I |35D . 



Step 2: Category Weights Optimal Under WIS 

For given measurement objective, calculate w^^^(Ci) as in Sec. [3] 



Step 3: Include Irrelevant Categories 

Modify w^^^{Ci}. /q - desired fraction of irrelevant nodes. 



Step 4: Tiny and Unknown Categories 

Modify vol(Ci). 7 - maximal resolution. 



Step 5: Edge Conflict Resolution 

Set the weights of intcr-category edges to Eg. ( 1281 . 



WRW sample 

Use transition probabilities proportional to edge wcig hts (Sec. [Tat. 



Correct for the bias 

Apply formulas from Sec. 12.4] 



Final result 



Figure 2: Overview of our approach. 

then weig ht w™'^(Ci) are achievedQ This simple observa- 
tion is central to the S-WRW heuristic. 

In order to apply Ea. (|23|l . we first have to calculate or 
estimate its terms vol(Ci) and w™^(Ci)0 Below, we show 
how to do it in Step 1 and 2, respectively. Next, in Steps 3-5, 
we show how to modify these terms to account for practical 
problems arising mainly from the underlying graph struc- 
ture. 

4.2 Our practical solution: S-WRW 

4.2.1 Step 1: Estimation of Category Volumes 

In general, we have no prior information about G or G'~" . 
Fortunately, it is easy and inexpensive estimate the relative 
category volumes /™' which is the first piece of information 
we need in Ea. (|23|l (see footnote [Sjl . Indeed, it is enough 
to run a relatively short pilot RW, and plug the collected 
sample S in Eq.(|35} derived in Appendix B, as follows 



- -E 



deg(u) 



6ca 



voi(a) 



(23) 



4.2.2 Step 2: Category Weights Optimal Under WIS 

In order to find the optimal WIS category weights w™^^ (Gi) 
in Eq. (|23|) . we first calculate n°^* as shown, under vari- 
ous scenarios, in Sec. [3] Next, we plug the resulting n°^^ 
in Eq.((22l), e.g., by setting w"'''(CO = nf\ 

4.2.3 Step 3: Irrelevant Categories 

^There exist many other edge weight assignments that lead 
to w™^(Ci). Ea. ij23p has the advantage of distributing the 
weights evenly across all vol(Ci) edges. 

^In fact, we need to know We(Ci) in Ea. (|23fl only up to 
a constant factor, because these factors cancel out in the 
calculation of transition probabilities of WRW in Eq.([8]). 
Consequently, the same applies to vol(Ci) and w*'^(Ci). 



(a) 



WIS: u>i>0, 
WRW: U)i=0, 




«)2 = 

u;2 >0 



WIS: «)i = 190^2 

WRW: Jill S 60 W2 for n=50 

tui ^ 100 W2 for n=500 
uii = 190 u;2 for n — > oo 



Figure 3: Optimal edge weights: WIS vs WRW. The 
objective is to compare the sizes of red (dark) and 
green (light) categories. 

Problem: Potentially poor or no convergence. Con- 
sider the toy example in Fig. [3{a). We are interested in 
finding the relative sizes of red (dark) and green (light) cat- 
egories. The white node in the middle is irrelevant for our 
measurement objective. Due to symmetry, we distinguish 
between two types of edges with weights and «J2- Un- 
der WIS, Eq. (|20|) gives us the optimal weights > and 
W2 = 0, i.e., WIS samples every non- white node with the 
same probability and never samples the white one. However, 
under WRW with these weights, relevant nodes get discon- 
nected into two components and WRW does not converge. 
We observed a similar problem in Fig. [1] 

Guideline: Occasionally visit irrelevant nodes. We 

show in Appendix D that the optimal WRW weights in 
Fig. [Sjja) are tui = and w-z > 0. In that case, half of 
the samples are due to visits in the white (irrelevant) node. 
In other words, WRW may benefit from allocating small 
weight w(Ce) > to category Cq that groups all (if any) 
categories irrelevant to our estimation. The intuition is that 
irrelevant nodes may not contribute to estimation but may 
be needed for connectivity or fast mixing. 

Implementation in S-WRW. In S-WRW, we achieve this 
goal by replacing w""^(Ci) with 



w 

/e 



E 



if a + Ce 

if a = Ce- 



(24) 



The parameter < /e ^ 1 controls the desired fraction of 
visits in Cq. 

4.2.4 Step 4: Tiny and Unknown Categories 

Problem: "black holes". Every optical system has a 
fundamental magnification limit due to diffraction and our 
"graph magnifying glass" is no exception. Consider the toy 
graph in Fig. |3jb) : it consists of a big clique Cbig of 20 red 
nodes with edge weights W2, and a green category Cti„y with 
two nodes only and edge weights wi. In Sec. 13.1.41 we saw 
that WIS optimally estimates the relative sizes of red and 
green categories for w(Cbig) = w(Cti„y), i.e., for wi — WOwz. 
However, for such large values of wi , the two green nodes be- 
have as a sink (or a "black hole") for a WRW of finite length, 
thus increasing the variance of the category size estimation. 

Guideline: limit edge weights. In other words, al- 
though WIS suggests to over-sample small categories, WRW 
should "under-over-sample" very small categories to avoid 
black holes. For example, in Fig.|3Ib) wi ~ 60 ^2 190^2) 
is optimal for WRW of length n — 50 (simulation results) . 

Implementation in S-WRW. In S-WRW, we achieve this 



goal by replacing vol(Ci) in Eq. (|23|l with 

vol(C) — max|vbl(C), vol„ii„|, where (25) 
volmin — — ■ max {vol(C)}. (26) 

Moreover, this formulation takes care of every category C 
that was not discovered by the pilot RW in Sec. 14.2.11 by 

setting vbl(C) = vol„iin. 

4.2.5 Step 5: Edge Conflict Resolution 

Problem: Conflicting desired edge weights. With 
the above modifications, our target edge weights defined in 
Eq. (|23I) can be rewritten as 



We(Ci) = 



vbi(co 



(27) 



We can directly set the weight w{u, v) = We{C'{u)) —We{C{v)) 
for every intra-category edge {u,v}. However, for every 
inter-category edge, we usually have "conflicting" weights 
We(C(u)) 7^ We(C(«)) desired at the two ends of the edge. 

Guideline: prefer inter-category edges. There are sev- 
eral possible edge weight assignments that achieve the de- 
sired category node weights. High weights on intra-category 
edges and small weights on inter-category edges result in 
WRW staying in small categories Cti„y for a long time. In 
order to improve the mixing time, we should do exactly 
the opposite, i.e., assign relatively high weights to inter- 
category edges (connecting relevant categories). As a result, 
WRW will enter Cti„y more often, but will stay there for 
a short time. This intuition is motivated by Monte Carlo 
variance reduction techniques such as the use of antithetic 
variates [15], which seek to induce negative correlation be- 
tween consecutive draws so as to reduce the variance of the 
resulting estimator. 

Implementation in S-WRW. We choose to assign an edge 
weight We that is in between these two values We{C{u)) 
and We{C{v)). We considered several candidate such as- 
signments. We may take the arithmetic or geometric mean 
of the conflicting weights, which we denote by w'"'(u, v) and 
w^'^{u,v), respectively. We may also use the maximum of 
the two values, w"'"'(u, v), which should improve mixing ac- 
cording to the discussion above. However, w'°'"'(m,u) alone 
would also add high weight to irrelevant nodes Cq (possibly 
far beyond /q). To avoid this undesired effect, we distin- 
guish between the two cases by defining a hybrid solution: 



if Cq e {C{u) 
otherwise. 



C{v)} 



(28) 



This hybrid edge assignment was the one we found to work 
best in practice - see Section [G] 

4.3 Discussion 

4.3.1 Information needed about the neighbors 

In the pilot RW (Sec. 14.2. ip as weU as in the main WRW, 
we assume that by sampling a node v we also learn the cat- 
egory (but not degree) of each of its neighbors u € N(v). 
Fortunately, such information is often available in most on- 
line graphs at no additional cost, especially when scraping 
html pages (as we do). For example, when sampling colleges 



in Facebook (Sec. [6|, we use the college membership infor- 
mation of all v's neighbors, which, in Facebook, is available 
at V together with the friends list. 

4.3.2 Cost of pilot RW 

The pilot RW volume estimator described in Sec. 14.2.11 
considers the categories not only of the sampled nodes, but 
also of their neighbors. As a result, it achieves high effi- 
ciency, as we show in simulations fSec. I5.3TT|| and Facebook 
measurements fSec. 16. l"]) . Given that, and high robustness 
of S-WRW to estimation errors (see Sec. I5.3.5p . pilot RW 
should be only a small fraction of the later WRW (e.g., 6.5% 
in our Facebook measurements in Sec. 

4.3.3 Setting the parameters 

S-WRW sets the edge weights trying to achieve roughly 
w""^(Ci) as the main goal. We slightly shape w""^(Ci) to 
avoid black holes and improve mixing, which is controlled 
by two natural and easy-to-interpret parameters, /q and 7. 

Irrelevant nodes visits /q. The parameter < /e ^ 1 
controls the desired fraction of visits in Cq. When set- 
ting /e, we should exploit the information provided by the 
pilot crawl. If the relevant categories appear poorly in- 
terconnected and often separated by irrelevant nodes, we 
should set /e relatively high. We have seen an extreme 
case in Fig. [31[a), with disconnected relevant categories and 
optimal /e = 0.5. In contrast, when the relevant categories 
are strongly interconnected, we should use much smaller /q. 
However, because we can never be sure that the graph in- 
duced on relevant nodes is connected, we recommend always 
using /e > 0. For example, when measuring Facebook in 
Sec. [1 we set /q = 1%. 

Maximal resolution 7. The parameter 7 > 1 can be in- 
terpreted as the maximal resolution of our "graph magnify- 
ing glass", with respect to the largest relevant category Cbig. 
S-WRW will typically sample well all categories that are 
less than 7 times smaller than Cbig; all categories smaller 
than that are relatively undersampled (see Sec. I6.2.4ll . In 
the extreme case, for 7 ^ 00, S-WRW tries to cover every 
category, no matter how small, which may cause the "black 
hole" problem discussed in Sec. 14.2.41 In the other extreme, 
for 7=1 (and identical w™^^(Ci) for all categories, includ- 
ing Cq), S-WRW reduces to RW. We recommend always 
setting 1 < 7 < 00. Ideally, we know ICsmaiioBtl - the small- 
est category size that is still relevant to us. In that case we 
should set 7 = |Cbig|/|Csmaiiost|Q For example, in Sec. [S]the 
categories are US colleges; we set 7=1000, because colleges 
with size smaller than 1/lOOOth of the largest one {i.e., with 
a few tens of students) seem irrelevant to our measurement. 
As another rule of thumb, we should try to set smaller 7 
for relatively small sample sizes and in graphs with tight 
community structure (see Sec. l5.3.5)l . 

4.3.4 Conservative approach 

Note that a reasonable setting of these parameters (i.e., 
/e > and 1 < 7 < 00, and any conflict resolution discussed 
in the paper), increases the weights of large categories (in- 
cluding Cq) and decreases the weight of small categories. 



compared to w"'^(Ci). This makes S-WRW allocate cat- 
egory weights between the two extremes: RW and WIS. 
Consequently, S-WRW can be considered conservative (with 
respect to WIS). 

4.3.5 S-WRW is unbiased 

It is also important to note that because the collected 
WRW sample is eventually corrected with the actual sam- 
pling weights as described in Sec. 12.41 S-WRW estimation 
process is unbiased, regardless of the choice of weights (so 
long as convergence is attained). In contrast, suboptimal 
weights {e.g., due to estimation error of can increase 
WRW mixing time, and/or the variance of the resulting esti- 
mator. However, our simulations and empirical experiments 
on Facebook (see Sec. 5 and 6) show that S-WRW is very 
robust to suboptimal choice of weights. 

5. SIMULATION RESULTS 

The gain of our approach compared to RW comes from 
two main factors. First, S-WRW avoids, to a large extent 
or completely, the nodes in Cq that are irrelevant to our 
measurement. This fact alone can bring an arbitrarily large 
improvement {jfz\c^\ under WIS), especially when Cq is 
large compared to A'^. We demonstrate this in the Facebook 
measurements in Section |6] Second, we can better allocate 
samples among the relevant categories. This factor is ob- 
servable in our Facebook measurements as well, but it is 
more difficult to evaluate due to the lack of ground-truth 
therein. In this section, we evaluate the optimal allocation 
gain in a controlled simulation and we demonstrate some 
key insights. 

5.1 Setup 

We consider a graph G with lOlK nodes and 505. 5K edges 
organized in two densely (and randomly) connected commu- 
nities as shown in Fig. 2[h). 

The nodes in G are partitioned into two node categories: 
Ctiny with IK nodes (dark red), and Cbig with lOOK nodes 
(light yellow). We consider two extreme scenarios of such a 
partition. The 'random' scenario is purely random, as shown 
in Fig. |4ja). In contrast, under 'clustered', categories Cti„y 
and Cbig coincide with the existing communities in G, as 
shown in Fig. IHh) . It is arguably the worst case scenario 
for graph sampling by exploration. 

We fix the edge weights of all internal edges in Cbig to 1. 
All the remaining edges, i.e., all edges incident on nodes 
in category Cti„y, have weight w each, where ui > 1 is a 
parameter. Note that this is equivalent to setting We(Cbig) = 
1, We(Cti„y) = w, and 'max' or 'hybrid' conflict resolution. 

5.2 Measurement objective and error metric 

We are mainly interested in measuring the relative sizes 
/tiny and /big of categories Cti„y and Cbig, respectively. 

We use Normalized Root Mean Square Error (NRMSE) to 
assess the estimation error, defined as 1371: 



NRMSE(S) = 



V^E[(£-x)2] 



(29) 



^Strictly speaking, 7 is related to volumes vol(Ci) rather 
than sizes \Ci\. They are equivalent when category volume 
is proportional to its size, which is often the case, and is the 
central assumption in the "scale- up method" [9]. 



where x is the real value and x is the estimated one. 



^The term "community" refers to cluster and is defined 
purely based on topology. The term "category" is a property 
of a node and is independent of topology. 
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Figure 4: RW and S-WRW under two scenarios: Random (a-g) and Clustered (h-n). In (b,i), we show error 
of two volume estimators: naive Ea. (|32p (dotted) and neighbor-based Eq.(|35p (plain). Next, we show error 
of size estimator as a function of n (c,j) and w (d,g,k,n); in the latter, UIS and RW correspond to WIS and 
S-WRW for w = l. In (e,l), we show the empirical probability that S-WRW visits Cuny at least once. Finally, 
(f,m) is gain a of S-WRW over RW under the optimal choice of w (plain), and for fixed 7 = m; = 5 (dashed). 



5.3 Results 

5.3.1 Estimating volumes is usually cheap 

The first step in S-WRW is obtaining category volume es- 
timates fi°^. We achieve it by running a short pilot RW and 
applying the estimator Eg. pSl) . We show NRMSE(/t™y) as 
plain curves in Fig. [IJb). This estimator takes advantage 
of the knowledge of the categories of the neighboring nodes, 
which makes it much more efficient than the naive estima- 
tor Eq. (|32P shown by dashed curves. Moreover, the advan- 
tage of Eq. (|35P over Eq.((32]) grows with the graph density 
and the skewness of its degree distribution (not shown here). 

Note that under 'random', RW and WIS (with the sam- 
pling probabilities of RW) are almost equally efficient. How- 
ever, on the other extreme, i.e., under the 'clustered' sce- 
nario, the performance of RW becomes much worse and the 
advantage of Ea. H35|) over Ea. (|32p diminishes. This is be- 
cause essentially all friends of a node from category d are 
in Ci too, which reduces formula Ea. H35l) to Eq. (|32I) . Nev- 
ertheless, we show later in Sec. 15.3.51 that even severalfold 
volume estimation errors are likely not to affect significantly 
the results. 

5.3.2 Visiting the tiny category 

Fig.[4je,l) presents the empirical probability PfCtmy visited] 
that our walk visits at least one node from Cti„y. Of course, 
this probability grows with the sample length. However, the 
choice of weight w also helps in it. Indeed, WRW with to > 1 
is more likely to visit Ctiny than RW (w — 1, bottom line). 



This demonstrates the first advantage of introducing edge 
weights and WRW. 

5.3.3 Optimal w and 7 

Let us now focus on the estimation error as a function 
of w, shown in Fig. |3fd,k). Interestingly, this error does not 
drop monotonically with w but follows a 'U' shaped function 
with a clear optimal value w°^*. 

Under WIS, we have to°^' ~ 100, which confirms our 
findings in Sec. 13.1.41 Indeed, according to EQ. (|20p . we 
need the same number of samples from the two categories, 
and thus w™'^(Ct.„y) = w*'^(Cbig) (by Eq.i^). By plug- 
ging this and vol(Cbig) = 100 ■ vol(Cti„y) to Eq.^^, we 
finally obtain the WIS-optimal edge weights in Cti„y, i.e., 

W"^' = We(Ct.„y) = 100 • We(Cbig) = 1000 

In contrast, WRW is optimized for w < 100. For the sam- 
ple length n = 500 as in Fig. |4jd,k), the error is minimized 
already for w"^*" ~ 20 and increases for higher weights. This 
demonstrates the "black hole" effect discussed in Sec. 14.2.41 
It is much more pronounced in the 'clustered' scenario, con- 
firming our intuition that black-holes become a problem only 
in the presence of relatively isolated, tight communities. Of 
course, the black hole effect diminishes with the sample 
length n (and completely vanishes for n — >■ 00), which can 
be observed in Fig. [4jg,n), especially in (n). 

In other words, the optimal assignment of edge weights 
(in relevant categories) under WRW lies somewhere between 



®For simplicity, we ignored in this calculation the confiicts 
on the 500 edges between Cbig and Cti„y. 



RW (all weights equal) and WIS. In S-WRW, we control 
it by parameter 7. In this example, we have 7 = lo for 
7 < 100. Indeed, by combining Eq.^, Eq.^, Eq.^, 

w™'^(Ct,„y)=w™'^(Cbig), we obtain 

^ w ^ w,{a,„,) ^ w"'"(C,,„J/v-ol(C,.„J 
1 w4C^,,) wWis(CbiJ/vbl(Cb.g) 
v'oKCb.J ^ VOl(Cb,J 
V-0l(a.„y) ivol(CM,) ^' 

Consequently, the optimal setting of 7 is the same as w°^* 
discussed above. 

5.3.4 Gain a 

The gain a of WIS over UIS is given by Eg. pU)) . In this 
case, we have a = (lOlA')^ ■ [4 ■ IK ■ lOOA')"^ ~ 25. In- 
deed, WIS with n — 500 samples shown in Fig. I^d) achieves 
NRMSE ~ 0.1, which is the same as UIS of about a = 25 
times more samples (see Fig. HJc)). 

This gain due to stratification is smaller for sampling by 
exploration: a 500-hop-long WRW with ui ~ 20 yields the 
same error NRMSE ~ 0.3 as a 2000-hop-long RW. This means 
that WRW reduces the sampling cost by a factor of a ~ 4. 
Fig. UJf) shows that this gain does not vary much with the 
sampling length. Under 'clustered', both RW and WRW 
perform much worse. Nevertheless, Fig. |4jm) shows that 
also in this scenario WRW may significantly reduce the sam- 
pling cost, especially for longer samples. 

It is worth noting that WRW can sometimes significantly 
outperform UIS. This is the case in Fig. |H^d), where UIS is 
equivalent to WIS with lu = 1. Because no walk can mix 
faster than UIS (that is independent and thus has perfect 
mixing), improving the mixing time alone |5I10I37I38] can- 
not achieve the potential gains of stratification, in general. 

So far we focused on the smaller set Cti„y only. When 
estimating the size of Cbig, all errors are much smaller, but 
we observe similar gain a. 

5.3.5 Robustness to 7 and volume estimation 

The gain a shown above is calculated for the optimal 
choice of w, or, equivalently, 7. Of course, in practice it 
might be impossible to obtain this value. Fortunately, S- 
WRW is relatively robust to the choice of parameters. The 
dashed lines in Fig.Uf,m) are calculated for 7 fixed to 7 = 5, 
rather than optimized. Note that this value is often dras- 
tically smaller than the optimal one {e.g., w°^* ~ 50 for 
n = 5000). Nevertheless, although the performance some- 
what drops, S-WRW still reduces the sampling cost about 
three- fold. 

This observation also addresses potential concerns one 
might have regarding the category volume estimation er- 
ror (see Sec. I4.2.T|) . Indeed, setting 7 = 5 means that every 
category d with volume estimated at vbl(Ci) < ivol(Cbig) 
is treated the same. In Fig. IHf), the volume of Cti„y would 
have to be overestimated by more than 20 times in order 
to affect the edge weight setting and thus the results. We 
have seen in Sec. l5.3.T] that this is very unlikely, even under 
smallest sample lengths and most adversarial scenarios. 

5.4 Summary 

WRW brings two types of benefits (i) avoiding irrelevant 
nodes Cq and (ii) carefully allocating samples between rele- 
vant categories of different sizes. Even when Ce = 0, WRW 



can still reduce the sampling cost by 75%. This second ben- 
efit is more difficult to achieve when the categories form 
strong and tight communities, which leads to the "black 
hole'" effect. We should then choose smaller, more conserva- 
tive values of 7 in S-WRW, which translate into smaller w in 
our example. In contrast, under a looser community struc- 
ture this problem disappears and WRW is closer to WIS. 

6. IMPLEMENTATION IN FACEBOOK 

As a concrete application, we apply S-WRW to measure 
the Facebook social graph, which is our motivating and 
canonical example. We also note that it is an undirected 
and can also be considered a static graph, for all practical 
purposes in this studyO In Facebook, every user may de- 
clare herself a member of a colleg43 he/she attends. This 
membership information is publicly available by default and 
allows us to answer some interesting questions. For example, 
how do the college networks (or "colleges" for short) compare 
with respect to their sizes? What is the college-to-college 
friendship graph? In order to answer these questions, we 
have to collect many college user samples, preferably evenly 
distributed between colleges. This is the main goal of this 
section. 

6.1 Measurement Setup 

By default, every Facebook user can see the basic informa- 
tion on any other user, including the name, photo, and a list 
of friends together with their college memberships (if any). 
We developed a high performance multi-threaded crawler to 
explore Facebook's social graph by scraping this web inter- 
face. 

To make informed decision for the parameters of S-WRW, 
we first ran a short pilot RW (see Sec. 14.2. l|l with a to- 
tal of 65K samples (which is only 6.5% of the length of 
the main S-WRW sample). Although our pilot walk visited 
only 2000 colleges, it estimated the relative volumes /™' 
for about 9500 colleges discovered among friends of sampled 
users, as discussed in Sec. 14.3.21 In Fig. [S^a) , we show that 
the neighbor-based estimator Eq. (|35|l greatly outperforms 
the naive estimator Ea. (|32p . These volumes cover several 
decades. Because colleges with only a few tens of users are 
not of our interest, we set the maximal resolution to 7 = 1000 
(see the discussion in Sec. 14. 3. 3|) . Finally, because the college 
students looked very well interconnected in our pilot RW, we 
set the desired fraction of irrelevant nodes to a small num- 
ber/e = l%. 

In the main measurement phase, we collected three S- 
WRW crawls, each with different edge weight conflict reso- 
lution (hybrid, geometric, and arithmetic), and one simple 
RW crawl as a baseline comparison (Table[T} . For each crawl 
type we collected 1 million unique users. Some of them are 
sampled multiple times (at no additional bandwidth cost), 
which results in higher total number of samples in the sec- 
ond row of Table [1] Our crawls were performed on Oct. 
16-19 2010, and are available at [T]. 

^The Facebook characteristics do change but in time scales 
much longer than the 3-day duration of our crawls. Websites 
such as Facebook statistics, Alexa etc show that the number 
of Facebook users is growing with rate 0.1-0.2% per day. 
*^There also exist categories other than colleges, namely 
"work" and "high school". Facebook requires a valid 
category-specific email for verification. 
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Figure 5: 5331 colleges discovered and ranked by RW. (a) Estimated relative college sizes (b) Absolute 
number of user samples per college, (c-e) 25 estimates of size fi for three different colleges and sample 
lengths n. (f) Average NRMSE of college size estimation. Results in (a,b,f) are binned. 

per college is almost constant (roughly around 100). In con- 
trast, the number of RW samples follows closely the college 
size, which results in dramatic 100-fold differences between 
RW and S-WRW for smaller colleges. 





RW 


S-WRW 






Hybrid 


Geometric 


Arithmetic 


Unique samples 


1,000K 


l.OOOK 


1,000K 


1,000K 


Total samples 


1,016K 


1,263K 


1,228K 


1,237K 


College samples 


9% 


86% 


79% 


58% 


Unique Colleges 


5,331 


9,014 


8,994 


10,439 



Table 1: Overview of collected Facebook datasets. 

6.2 Results: RW vs. S-WRW 

6.2.1 Avoiding irrelevant categories 

Only 9% of the RW's samples come from colleges, which 
means that the vast majority of sampling effort is wasted. 
In contrast, the S-WRW crawls achieved 6-10 better effi- 
ciency, collecting 86% (hybrid), 79% (geometric) and 58% 
(arithmetic) samples from colleges. Note that these values 
are significantly lower than the target 99% suggested by our 
choice of /g = 1%, and that S-WRW hybrid reaches the 
highest number. This is in agreement with our discussion in 
Sec. 14.2.51 Finally, we also note that S-WRW crawls discov- 
ered 1.6 — 1.9 times more unique colleges than RW. 

It might seem surprising that RW samples colleges in 9% 
of cases while only 3.5% of Facebook users belong to colleges. 
This can be explained by looking at the last rows of Table [T] 
Indeed, the college users have on average three times more 
Facebook friends than average users, and therefore they at- 
tract RW approximately three times more often. 

6.2.2 Stratification 

The advantage of S-WRW over RW does not lie exclusively 
in avoiding the nodes in the irrelevant category Cq . S-WRW 
can also over-sample small categories (here colleges) at the 
cost of under-sampling large ones (which are very well sam- 
pled anyway). This feature becomes important especially 
when the category sizes differ significantly, which is the case 
in Facebook. Indeed, Fig. [SJa) shows that college sizes ex- 
hibit great heterogeneity. For a fair comparison, we only 
include the 5,331 colleges discovered by RW. (In fact, this 
filtering actually gives preference to RW. S-WRW crawls 
discovered many more colleges that we do not show in this 
figure.) They span more than two orders of magnitude and 
follow a heavily skewed distribution (not shown here). 

Fig. [SJb) confirms that S-WRW successfully oversamples 
the small colleges. Indeed, the number of S-WRW samples 



6.2.3 College size estimation 

With more samples per college, we naturally expect a bet- 
ter estimation accuracy under S-WRW. We demonstrate it 
for three colleges of different sizes (in terms of the number 
of Facebook users): MIT (large), Caltech (medium), and 
Eindhoven University of Technology (small). Each boxplot 
in Fig. El c-e) is generated based on 25 independent college 
size estimates fi that come from walks of length n = 4K 
(left), 20K (middle), and 40K (right) samples each. For the 
three studied colleges, RW fails to produce reliable estimates 
in all cases except for MIT (largest college) under the two 
longest crawls. Similar results hold for the overwhelming 
majority of middle-sized and small colleges. The underly- 
ing reason is the very small number of samples collected by 
RW in these colleges, averaging at below 1 sample per walk. 
In contrast, the three S-WRW crawls contain typically 5-50 
times more samples than RW (in agreement with Fig.[5lb)), 
and produce much more reliable estimates. 

Finally, we aggregate the results over all colleges and com- 
pute the gain a of S-WRW over RW. We calculate the error 
NRMSE(/i) by taking as our "ground truth" fi the grand av- 
erage of fi values over all samples collected via all full-length 
walks and crawl types. Fig. [5lf ) presents NRMSE(/i) aver- 
aged over all 5,331 colleges discovered by RW, as a function 
of walk length n. As expected, for all crawl types the error 
decreases with n. However, there is a consistent large gap 
between RW and all three versions of S-WRW. RW needs 
13-15 times more samples than S-WRW in order to achieve 
the same error. 

6.2.4 The effect of the choice of J 

Recall that in all the S-WRW results described above, we 
used the resolution 7 = 1000. In order to check how sensitive 
the results are to the choice of this parameter, we also tried 
a (shorter) S-WRW run with 7= 100, i.e., ten times smaller. 
In Fig. IHb) , we see that the number of samples collected in 
the smallest colleges is smaller under 7 — 100 than under 
7 = 1000. In fact, the two curves diverge for colleges about 
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Figure 6: Facebook: Pilot RW and other walks of 
the same length n — 65K. (a) The performance of 
the neighbor-based volume estimator Ea. (|35[l (plain 
line) and the naive one Ea. (|32|l (dashed line). As 
'ground-truth' we used calculated for all 4xlM 
collected samples, (b) The effect of the choice of 7. 

100 times smaller than the biggest college, i.e., exactly at 
the maximal resolution 7 = 100. 

In any case, both settings of 7 perform orders of magni- 
tude better than RW of the same length. 

6.3 Summary 

Only about 3.5% of 500M Facebook users are college mem- 
bers. There are more than lOK colleges and they greatly 
vary in size, ranging from 50 (or fewer) to 50K members 
(we aggregate students, alumni and staff). In this setting, 
state-of-the-art sampling methods such as RW are bound 
to perform poorly. Indeed, UIS, i.e., an idealized version 
of RW, with as many as IM samples will collect only one 
sample from size-500 college, on average. Even if we could 
magically sample directly only from colleges, we would typ- 
ically collect fewer than 30 samples per size-500 college. 

S-WRW solves these problems. We showed that S-WRW 
of the same length collects typically about 100 samples per 
size-500 college. As a result, S-WRW outperforms RW by 
a = 13 — 15 times or q = 12 — 14 times if we also consider 
the 6.5% overhead from the initial pilot RW. This huge gain 
can be decomposed into two factors, say a = oi ■ 02, as we 
proposed in Ea. (|2ip . Factor ai ~ 8 can be attributed to a 
about 8 times higher fraction of college samples in S-WRW 
compared to RW. Factor 02 — 1.5 is due to over-sampling 
smaller networks, i.e., by applying stratified sampling. 

Another important observation is that S-WRW is robust 
to the way we resolve target edge weight conflicts in Sec. 14.2.51 
The differences between the three S-WRW implementations 
are minor - it is the application of Eq. (|27p that brings most 
of the benefit. 

7. RELATED WORK 

Graph Sampling by Exploration. Early crawling of 
P2P, OSN and WWW typically used graph traversals, mainly 
BFS [5I5Tti33l43| and its variants. However, incomplete BFS 
introduces bias towards high-degree nodes that is unknown 
and thus impossible to correct in general graphs [2I8I17I251 
126) . Later studies followed a more principled approach based 
on random walks (RW) [4[29] . The Metropohs-Hasting RW 
(MHRW) |16I30) removes the bias during the walk; it has 
been used to sample P2P networks |35I40) and OSNs [17]. 
Alternatively, we can use RW, whose bias is known and 
can be corrected for [20139] , thus leading to a re- weighted 



RW 117135). RW was als o used to sam ple Web [21], P2P net- 
works )18I35I40) . OSNs [17I24I33I36) and other large graphs 
[27) . It was empirically shown in [17135) that RW outper- 
forms MHRW in measurement accuracy. Therefore, RW can 
be considered as the state-of-the-art. 

Random walks have also been used to sample dynamic 
graphs [35140142] . which are outside the scope of this paper. 

Fast Mixing Markov Chains. The mixing time of a 
random walk determines the efficiency of the sampling. On 
the practical side, the mixing time of RW in many OSNs was 
found larger than commonly believed [33] ■ Multiple depen- 
dent random walks [37] have been used to sample discon- 
nected and loosely connected graphs. Random walks with 
jumps have been used to sample large graphs in [5138) and 
in [27]. All the above methods treat all nodes with equal 
importance, which is orthogonal to our technique. 

On the theoretical side, in [10], the authors propose a 
method to set edge weights that achieve the fastest mix- 
ing WRW for a given target stationary distribution. This 
technique, although related, is not applicable in our context. 
First, [TD] requires the knowledge of the graph, which makes 
it inapplicable to G, yet possibly feasible in G'~^ (after esti- 
mating some limited information about G'~^ as in Sec. 14. 2. T|) . 
In the latter case, however, even given a perfect knowledge 
of G"^, [10] often assigns weight to some self-loops, which 
likely makes the underlying graph G disconnected. Finally, 
and most importantly, [10) takes a target stationary distri- 
bution as input. By taking w^'^, we will face exactly the 
same problems of potentially poor convergence (Sec. I4.2.3)) 
and "black holes" ('Sec. l4.2.4|l as we addressed by S-WRW. 

Stratified Sampling. Our approach builds on stratified 
sampling [34], a widely used technique in statistics; see [121 
I28| for a good introduction. 

A related work in a different networking problem is [14) . 
where threshold sampling is used to vary sampling proba- 
bilities of network traffic flows and estimate their volume. 

Weighted Random Walks for Sampling. Random 
walks on graphs with weighted edges, or equivalently re- 
versible Markov chains [4129) . are well studied and heavily 
used in Monte Carlo Markov Chain simulations [T^ to sam- 
ple a state space with a specified probability distribution. 
However, to the best of our knowledge, WRWs have not 
been designed explicitly for measurements of real online sys- 
tems. In the context of sampling OSNs, the closest works 
are [5I38| . Technically speaking, they use WRW. But they 
set as their only objective the minimization of the mixing 
time, which makes them orthogonal and complementary to 
our approach, as we discussed above. 

Very recent applications of weighted random walks in on- 
line social networks include [617) . [7] uses WRW in the con- 
text of link prediction. The authors employ supervised learn- 
ing techniques to set the edge weights, with the goal of in- 
creasing the probability of visiting nodes that are more likely 
to receive new links. [B] introduces WRW-based methods to 
generate samples of nodes that are internally well-connected 
but also approximately uniform over the population. In both 
these papers, WRW is used to predict /extract something 
from a known graph. In contrast, we use WRW to estimate 
features of an unknown graph. 

In the context of World Wide Web crawling, focused crawl- 
ing techniques [11113) have been introduced to follow web 
pages of specified interest and to avoid the irrelevant pages. 



This is achieved by performing a BFS type of sample, except 
that instead of fifo queue they use a priority queue weighted 
by the page relevancy. In our context, such an approach 
suffers from the same problems as regular BFS: (i) collected 
samples strongly depend on the starting point, and (ii) we 
are not able to unbias the sample. 

8. CONCLUSION 

We introduced Stratified Weighted Random Walk (S-WRW) 
- an efficient way to sample large, static, undirected graphs 
via crawling and using minimal information. S-WRW per- 
forms a weighted random walk on the graph with weights 
determined by the estimation problem. We apply our ap- 
proach to measure the Facebook social graph, and we show 
that S-WRW greatly outperforms the state-of-art sampling 
technique, namely the simple re-weighted random walk. 

There are several directions for future work. First, S- 
WRW is currently an intuitive and efficient heuristic; in fu- 
ture work, we plan to investigate the optimal solution to 
problems identified in this paper and compare against or im- 
prove S-WRW. Second, it may be possible to combine these 
ideas with existing orthogonal techniques, some of which 
have been reviewed in Related Work, to further improve 
performance. Finally, we are interested in extending our 
techniques to dynamic graphs and non-stratified data. 
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Appendix A: Achieving Arbitrary Node Weights 

Achieving arbitrary node weights by setting the edge weights 
in a graph G = {V, E) is sometimes impossible. For example, 
for a graph that is a path consisting of two nodes (v\ —112), 
it is impossible to achieve w{vi) 7^ w(t;2). However, it is 
always possible to do so, if there are self loops in each node. 

Observation 1. For any undirected graph G = {V,E) 
with a self-loop {v, u} at every node v £ V , we can achieve 
an arbitrary distribution of node weights w{v) > 0, v £ V , 
by appropriate choice of edge weights w{u, v) > 0, {u, v} £ E. 

Proof. Denote by Wmin the smallest of all target node 
weights w{v). Set w{u,v) — w^in/N for all non self- loop 
edges (i.e., where u 7^ v). Now, for every self-loop {v, v} £ E 
set 



N 



{deg{v)^2) 



It is easy to check that, because there are exactly deg(i;) — 2 
non self-loop edges incident on v, every node v £ V will 
achieve the target weight w{v). Moreover, the definition of 
Wmin guarantees that w(u, v) > for every v £ V. □ 



Appendix B: Estimating Category Volumes 

In this section, we derive efficient estimators of the volume 



vol(C) 



Recall that S C V denotes an indepen- 



ratio fc = . 
dent sample of nodes in G, with replacement. 
Node sampling 

If 5* is a uniform sample UIS, then we can write 



E„6gdeg(«) ■ 1{„6C} 
E.esdeg(v) 



(30) 



which is a straightforward application of the classic ratio 
estimator [28) . 

In the more general case, when S is selected using WIS, 
then we have to correct for the linear bias towards nodes of 
higher weights w(), as follows: 



In particular, if w{v) 
fc' 



E„6sdeg(t') ■ l{.;gc}/w(-i;) 
E„esdeg(«)/w(t;) 

~ deg(w), then 



vec}- 



(31) 



(32) 



Star sampling 

Another approach is to focus on the set of all neighbors 
M{S) of sampled nodes (with repetitions) rather than on S 
itself, i.e., to use 'star sampling' [23]. The probability that 
a node v is a, neighbor of a node sampled from V by UIS is 



y- 



L{DeAf(u)} 



deg(^;) 
N 



Consequently, the nodes in M{S) are asymptotically equiva- 
lent to nodes drawn with probabilities linearly proportional 
to node degrees. By applying Eq. (|32P to J\f{S), we obtairQ 



c — 



vol(S) 



ties vej^(u) 



C}, 



(33) 



where we used |7V(S')] — Eugs'^®s(^) ~ vo^S"). 

In the more general case, when 5* is selected using WIS, 
then we correct for the linear bias towards nodes of higher 
weights w(), as follows: 



E 

ties 



deg(u) ^ ( w(u) ^ 
— Hi-i. ties \ ^ ' veM{u) . 



(34) 



In particular, if w{v) ~ deg(w), then 



Tc = 



-E 

ties 



deg(u) 



E 1{"SC} 



(35) 



v£J\f(u) 



Note that for every sampled node v £ S, the formulas 
Eq. (|33I35|I exploit all the deg(ii) neighbors of v, whereas 
Ea. (|30l32|l rely on one node per sample only. Not surpris- 
ingly, Ea. p3l35|l performed much better in all our simula- 
tions and implementations. 

®As a side note, observe that formula Ea.(|33[l generalizes the 
"scale-up method" [9] used in social sciences to estimate the 
size (here |C|) of hidden populations {e.g., of drug addicts). 
Indeed, if we assume that the average node degree in V is the 
sam e a s in C, then vol(C)/vol(l/) = \C\/N, which reduces 
Ea. (|32[l to the core formula of the scale-up method. 



Appendix C: Relative sizes of node categories 

Consider a scenario with only two node categories, i.e., C = 
{Ci,C2}. Denote /i = |Ci|/iV and /a = \C2\/N. The goal 
is to estimate /i and /a based on the collected sample S. 

UIS - Uniform independence sampling. 

Under UIS, the number of times we select a node 
from Ci among n attempts follows the Binomial distribu- 
tion X\ — Binom{fi,n). Therefore, we can estimate /i as 



fi 



Xi 



with 



n 



(36) 



WIS - Weighted independence sampling. 

In contrast, under WIS, at every iteration the probability 
tt{v) of selecting a node v is: 



7r(i;) 



if w G Ci, and 



TTa = i T^x— r if f G C2, 



where wi and it)2 are the weights w{v) of nodes in Ci and 
C2, respectively. 

By applying the Hansen-Hurwitz estimator (separately for 
nominator and denominator), we obtain 

?wis ^ \Ci[ ^ E^ggl^eCi /7r(-i;) 
Xi/tvi 



Xi/m + {n-Xi)/-K2 

Xl ■ TT2 
Xl(7r2 — TTl) + n • TTl 
Xl ■ W2 

Xl (w2 — wi) + n ■ wi ' 



(37) 



where Xi is the number of samples taken from Ci. Note, 
that to calculate /"'^ we only need values wi and W2, which 
are set by us and thus known. 

Computing the variance of /i^'^ is a bit more challeng- 
ing. We use the second-order Taylor expansions (the 'Delta 
method') to approximate it as follows: 



a/7 



nw\W2 



dXi 



i{w2 — wi)Xi + nwi)^ ' 
2 



and 



dXi 



(E(XO) j V(Xi) 
-^^^ ■(/1M1 + /2W2)'. (38) 



In the above derivation, we used the fact that E(Xi) — 
nNf iTvi and Y{Xi) = nN^ fiTTif2TV2- This comes from the 
fact that Xl actually follows the binomial distribution Xi = 
Binom{N fini,n). 

For wi = W2, we are back in the UIS case. But this is not 
necessarily the optimal choice of weights. Indeed, a quick 
application of Lagrange multipliers reveals that V(/™^^) is 
minimized when 



Wi fi = /2 W2. 



(39) 



Moreover, analogous analysis shows that Ea. (|39|l minimizes 
V(/™^^) as well. In other words, the estimators of both fi 
and /2 have the lowest variance if the total weighted mass 



of Ci is equal to that of C2. This implies, in expectation, 
equal allocation of samples between Ci and C2, i.e., 

WIS ^ 

\c\ 



Finally, we can use Eq.(l36|, Eq.(l38l) and Eq.([39| to cal- 
culate the gain a of WIS over UIS 



1 



(>!)• 



(40) 



4/1/2 

Note that we always have a > 1, and a grows quickly with 
growing difference between fi and /2. 

Appendix D: Optimal WRW weights in Fig.|3ta) 

Every time WRW visits the white node/category in Fig.[3ja), 
the next node is chosen uniformly from red and green cat- 
egories. We stay in this selected category for k rounds, 
where fc is a geometric random variable with parameter 
p = W2/{wi+'W2) G [0, 1]. Next, we come back to the white 
category, and reiterate the process. So the number nrod of 
times the red category is sampled is 

i5inom(0.5,ri^j^ ) 

rirad ~ Geom{p), 
1 

where riwh is the number of visits to the white category. 
Because the random variables generated by Binom{0.5, n„h) 
and Geom{p) are independent, we can write 

E[n„d] = E[Bmom(0.5,nwh)] • E[Geom(p)] = 0.5n„h/p 
V[n„d] = E[Binom{)\Y[Georn{)]+¥?[Georn{)\Y[Binorn{)\ 

A possible unbiased estimator of the relative size /„d of red 
category (among relevant categories) is 



Acd = 



nrcd 

n„h/p' 



for which we get 
E[/„d] 



E[w,ed] 

nwh/p 
V[n,.dl 



— (unbiased) 

3 - 2p 

4n„h 



(n„,/p)2 

This variance is expressed as a function of n„h, and not of the 
total sample length n. However, note that riwh drops with 
decreasing p. Consequently, the variance V[/rcd] (expressed 
as a function of n„h or of n) is minimized for p = 1, i.e., for 
wi = Q and UI2 > (and n„h = n/2). 



